Introduction

This IPython notebook explains a basic workflow two tables using py_entitymatching. Our goal is to come up with a workflow to match DBLP and ACM datasets. Specifically, we want to achieve precision greater than 95% and get recall greater than 90%. The datasets contain information about the conference papers published in top databse conferences.

First, we need to import py_entitymatching package and other libraries as follows:



In [1]:

    
import sys
sys.path.append('/Users/pradap/Documents/Research/Python-Package/anhaid/py_entitymatching/')

import py_entitymatching as em
import pandas as pd
import os



In [2]:

    
# Display the versions
print('python version: ' + sys.version )
print('pandas version: ' + pd.__version__ )
print('magellan version: ' + em.__version__ )









    



python version: 2.7.13 | packaged by conda-forge | (default, May  2 2017, 13:29:36) 
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)]
pandas version: 0.20.3
magellan version: 0.2.0

Matching two tables typically consists of the following three steps:

1. Reading the input tables

2. Blocking the input tables to get a candidate set

3. Matching the tuple pairs in the candidate set

Read Input Tables

We begin by loading the input tables. For the purpose of this guide, we use the datasets that are included with the package.



In [3]:

    
# Get the paths
path_A = em.get_install_path() + os.sep + 'datasets' + os.sep + 'end-to-end' + os.sep + 'dblp_demo.csv'
path_B = em.get_install_path() + os.sep + 'datasets' + os.sep + 'end-to-end' + os.sep + 'acm_demo.csv'



In [4]:

    
# Load csv files as dataframes and set the key attribute in the dataframe
A = em.read_csv_metadata(path_A, key='id')
B = em.read_csv_metadata(path_B, key='id')



In [5]:

    
print('Number of tuples in A: ' + str(len(A)))
print('Number of tuples in B: ' + str(len(B)))
print('Number of tuples in A X B (i.e the cartesian product): ' + str(len(A)*len(B)))









    



Number of tuples in A: 1800
Number of tuples in B: 1813
Number of tuples in A X B (i.e the cartesian product): 3263400



In [6]:

    
A.head(2)









    Out[6]:







  
    
      
      id
      title
      authors
      venue
      paper year
    
  
  
    
      0
      l0
      Paradise: A Database System for GIS Applications
      Paradise Team
      SIGMOD Conference
      1995
    
    
      1
      l1
      A Query Language and Optimization Techniques for Unstructured Data
      Gerd G. Hillebrand, Peter Buneman, Susan B. Davidson, Dan Suciu
      SIGMOD Conference
      1996



In [7]:

    
B.head(2)









    Out[7]:







  
    
      
      id
      title
      authors
      venue
      paper year
    
  
  
    
      0
      r0
      An efficient bitmap encoding scheme for selection queries
      Chee-Yong Chan, Yannis E. Ioannidis
      International Conference on Management of Data
      1999
    
    
      1
      r1
      Integrating a Structured-Text Retrieval System with an Object-Oriented Database System
      Tak W. Yan, Jurgen Annevelink
      Very Large Data Bases
      1994



In [8]:

    
# Display the key attributes of table A and B.
em.get_key(A), em.get_key(B)









    Out[8]:





('id', 'id')

Block Tables To Get Candidate Set

Before we do the matching, we would like to remove the obviously non-matching tuple pairs from the input tables. This would reduce the number of tuple pairs considered for matching. py_entitymatching provides four different blockers: (1) attribute equivalence, (2) overlap, (3) rule-based, and (4) black-box. The user can mix and match these blockers to form a blocking sequence applied to input tables.

For the matching problem at hand, we know that two conference papers published in different years cannot match, or if there are errors in the year then there should be at least some overlap between the paper titles. So we decide the apply the following blocking plan:



In [9]:

    
# Blocking plan

# A, B -- AttrEquivalence blocker [year] --------------------|
#                                                           |---> candidate set
# A, B -- Overlap blocker [title]---------------------------|



In [10]:

    
# Create attribute equivalence blocker
ab = em.AttrEquivalenceBlocker()
# Block tables using 'year' attribute : same year include in candidate set
C1 = ab.block_tables(A, B, 'paper year', 'paper year', 
                     l_output_attrs=['title', 'authors', 'paper year'],
                     r_output_attrs=['title', 'authors', 'paper year']
                    )
len(C1)









    Out[10]:





327703



In [13]:

    
# Initialize overlap blocker
ob = em.OverlapBlocker()
# Block over title attribute
C2 = ob.block_tables(A, B, 'title', 'title', show_progress=False, overlap_size=2)
len(C2)









    Out[13]:





244220



In [14]:

    
# Combine the outputs from attr. equivalence blocker and overlap blocker
C = em.combine_blocker_outputs_via_union([C1, C2])
len(C)









    Out[14]:





544632

Match Tuple Pairs in Candidate Set

In this step, we would want to match the tuple pairs in the candidate set. Specifically, we use learning-based method for matching purposes. This typically involves the following four steps:

Sampling and labeling the candidate set
Splitting the labeled data into development and evaluation set
Selecting the best learning based matcher using the development set
Evaluating the selected matcher using the evaluation set

Sampling and labeling the candidate set

First, we randomly sample 450 tuple pairs for labeling purposes.



In [15]:

    
# Sample  candidate set
S = em.sample_table(C, 450)

Next, we label the sampled candidate set. Specify we would enter 1 for a match and 0 for a non-match.



In [16]:

    
# Label S
#G = em.label_table(S, 'label')

For the purposes of this guide, we will load in a pre-labeled dataset (of 450 tuple pairs) included in this package.



In [17]:

    
# Load the pre-labeled data
path_G = em.get_install_path() + os.sep + 'datasets' + os.sep + 'end-to-end' + os.sep + 'labeled_data_demo.csv'
G = em.read_csv_metadata(path_G, 
                         key='_id',
                         ltable=A, rtable=B, 
                         fk_ltable='ltable_id', fk_rtable='rtable_id')
len(G)









    



No handlers could be found for logger "py_entitymatching.io.parsers"






    Out[17]:





450

Splitting the labeled data into development and evaluation set

In this step, we split the labeled data into two sets: development (I) and evaluation (J). Specifically, the development set is used to come up with the best learning-based matcher and the evaluation set used to evaluate the selected matcher on unseen data.



In [18]:

    
# Split S into development set (I) and evaluation set (J)
IJ = em.split_train_test(G, train_proportion=0.7, random_state=0)
I = IJ['train']
J = IJ['test']

Selecting the best learning-based matcher

Selecting the best learning-based matcher typically involves the following steps:

Creating a set of learning-based matchers
Creating features
Converting the development set into feature vectors
Selecting the best learning-based matcher using k-fold cross validation

Creating a Set of Learning-based Matchers



In [19]:

    
# Create a set of ML-matchers
dt = em.DTMatcher(name='DecisionTree', random_state=0)
svm = em.SVMMatcher(name='SVM', random_state=0)
rf = em.RFMatcher(name='RF', random_state=0)
lg = em.LogRegMatcher(name='LogReg', random_state=0)
ln = em.LinRegMatcher(name='LinReg')

Creating Features

Next, we need to create a set of features for the development set. py_entitymatching provides a way to automatically generate features based on the attributes in the input tables. For the purposes of this guide, we use the automatically generated features.



In [21]:

    
# Generate features
feature_table = em.get_features_for_matching(A, B, validate_inferred_attr_types=False)



In [22]:

    
# List the names of the features generated
feature_table['feature_name']









    Out[22]:





0                          id_id_lev_dist
1                           id_id_lev_sim
2                               id_id_jar
3                               id_id_jwn
4                               id_id_exm
5                   id_id_jac_qgm_3_qgm_3
6             title_title_jac_qgm_3_qgm_3
7         title_title_cos_dlm_dc0_dlm_dc0
8                         title_title_mel
9                    title_title_lev_dist
10                    title_title_lev_sim
11        authors_authors_jac_qgm_3_qgm_3
12    authors_authors_cos_dlm_dc0_dlm_dc0
13                    authors_authors_mel
14               authors_authors_lev_dist
15                authors_authors_lev_sim
16              paper_year_paper_year_exm
17              paper_year_paper_year_anm
18         paper_year_paper_year_lev_dist
19          paper_year_paper_year_lev_sim
Name: feature_name, dtype: object

Converting the Development Set to Feature Vectors



In [23]:

    
# Convert the I into a set of feature vectors using F
H = em.extract_feature_vecs(I, 
                            feature_table=feature_table, 
                            attrs_after='label',
                            show_progress=False)



In [24]:

    
# Display first few rows
H.head(3)









    Out[24]:







  
    
      
      _id
      ltable_id
      rtable_id
      id_id_lev_dist
      id_id_lev_sim
      id_id_jar
      id_id_jwn
      id_id_exm
      id_id_jac_qgm_3_qgm_3
      title_title_jac_qgm_3_qgm_3
      ...
      authors_authors_jac_qgm_3_qgm_3
      authors_authors_cos_dlm_dc0_dlm_dc0
      authors_authors_mel
      authors_authors_lev_dist
      authors_authors_lev_sim
      paper_year_paper_year_exm
      paper_year_paper_year_anm
      paper_year_paper_year_lev_dist
      paper_year_paper_year_lev_sim
      label
    
  
  
    
      221
      221
      l1114
      r597
      5
      0.0
      0.0
      0.0
      0
      0.0
      0.057692
      ...
      0.035294
      0.0
      0.548358
      41.0
      0.226415
      1
      1.0
      0.0
      1.0
      0
    
    
      439
      439
      l1577
      r688
      5
      0.0
      0.0
      0.0
      0
      0.0
      0.052980
      ...
      0.088235
      0.0
      0.536692
      61.0
      0.140845
      1
      1.0
      0.0
      1.0
      0
    
    
      191
      191
      l1174
      r588
      5
      0.0
      0.0
      0.0
      0
      0.0
      0.072464
      ...
      0.044444
      0.0
      0.515819
      45.0
      0.166667
      1
      1.0
      0.0
      1.0
      0
    
  

3 rows × 24 columns

Selecting the Best Matcher Using Cross-validation

Now, we select the best matcher using k-fold cross-validation. For the purposes of this guide, we use five fold cross validation and use 'precision' and 'recall' metric to select the best matcher.



In [27]:

    
# Select the best ML matcher using CV
result = em.select_matcher([dt, rf, svm, ln, lg], table=H, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],
        k=5,
        target_attr='label', metric_to_select_matcher='precision', random_state=0)
result['cv_stats']









    Out[27]:







  
    
      
      Matcher
      Average precision
      Average recall
      Average f1
    
  
  
    
      0
      DecisionTree
      0.905960
      0.922281
      0.909402
    
    
      1
      RF
      1.000000
      0.922281
      0.957971
    
    
      2
      SVM
      1.000000
      0.842432
      0.911244
    
    
      3
      LinReg
      1.000000
      0.912281
      0.951622
    
    
      4
      LogReg
      0.988889
      0.922281
      0.952866



In [29]:

    
# Select the best ML matcher using CV
result = em.select_matcher([dt, rf, svm, ln, lg], table=H, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],
        k=5,
        target_attr='label', metric_to_select_matcher='recall', random_state=0)
result['cv_stats']









    Out[29]:







  
    
      
      Matcher
      Average precision
      Average recall
      Average f1
    
  
  
    
      0
      DecisionTree
      0.905960
      0.922281
      0.909402
    
    
      1
      RF
      1.000000
      0.922281
      0.957971
    
    
      2
      SVM
      1.000000
      0.842432
      0.911244
    
    
      3
      LinReg
      1.000000
      0.912281
      0.951622
    
    
      4
      LogReg
      0.988889
      0.922281
      0.952866

We observe that the best matcher (RF) is getting us to the precision and recall that we expect (i.e P > 95% and R > 90%). So, we select this matcher and now we can proceed on to evaluating the best matcher on the unseen data (the evaluation set).

Evaluating the Matching Output

Evaluating the matching outputs for the evaluation set typically involves the following four steps:

Converting the evaluation set to feature vectors
Training matcher using the feature vectors extracted from the development set
Predicting the evaluation set using the trained matcher
Evaluating the predicted matches

Converting the Evaluation Set to Feature Vectors

As before, we convert to the feature vectors (using the feature table and the evaluation set)



In [30]:

    
# Convert J into a set of feature vectors using feature table
L = em.extract_feature_vecs(J, feature_table=feature_table,
                            attrs_after='label', show_progress=False)

Training the Selected Matcher

Now, we train the matcher using all of the feature vectors from the development set. For the purposes of this guide we use random forest as the selected matcher.



In [31]:

    
# Train using feature vectors from I 
dt.fit(table=H, 
       exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'], 
       target_attr='label')

Predicting the Matches

Next, we predict the matches for the evaluation set (using the feature vectors extracted from it).



In [33]:

    
# Predict on L 
predictions = dt.predict(table=L, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'], 
              append=True, target_attr='predicted', inplace=False)

Evaluating the Matching Output

Finally, we evaluate the accuracy of predicted outputs



In [34]:

    
# Evaluate the predictions
eval_result = em.eval_matches(predictions, 'label', 'predicted')
em.print_eval_summary(eval_result)









    



Precision : 95.12% (39/41)
Recall : 97.5% (39/40)
F1 : 96.3%
False positives : 2 (out of 41 positive predictions)
False negatives : 1 (out of 94 negative predictions)

	id	title	authors	venue	paper year
0	l0	Paradise: A Database System for GIS Applications	Paradise Team	SIGMOD Conference	1995
1	l1	A Query Language and Optimization Techniques for Unstructured Data	Gerd G. Hillebrand, Peter Buneman, Susan B. Davidson, Dan Suciu	SIGMOD Conference	1996

	id	title	authors	venue	paper year
0	r0	An efficient bitmap encoding scheme for selection queries	Chee-Yong Chan, Yannis E. Ioannidis	International Conference on Management of Data	1999
1	r1	Integrating a Structured-Text Retrieval System with an Object-Oriented Database System	Tak W. Yan, Jurgen Annevelink	Very Large Data Bases	1994

	_id	ltable_id	rtable_id	id_id_lev_dist	title_title_jac_qgm_3_qgm_3	...	authors_authors_jac_qgm_3_qgm_3	authors_authors_mel	authors_authors_lev_dist	authors_authors_lev_sim	paper_year_paper_year_exm	paper_year_paper_year_anm	paper_year_paper_year_lev_sim
221	221	l1114	r597	5	0.057692	...	0.035294	0.548358	41.0	0.226415	1	1.0	1.0
439	439	l1577	r688	5	0.052980	...	0.088235	0.536692	61.0	0.140845	1	1.0	1.0
191	191	l1174	r588	5	0.072464	...	0.044444	0.515819	45.0	0.166667	1	1.0	1.0

	Matcher	Average precision	Average recall	Average f1
0	DecisionTree	0.905960	0.922281	0.909402
1	RF	1.000000	0.922281	0.957971
2	SVM	1.000000	0.842432	0.911244
3	LinReg	1.000000	0.912281	0.951622
4	LogReg	0.988889	0.922281	0.952866